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Abstract —In this paper, we propose a multi-kernel classifier 
learning algorithm to optimize a given nonlinear and nonsmoonth 
multivariate classifier performance measure. Moreover, to solve 
the problem of kernel function selection and kernel parameter 
tuning, we proposed to construct an optimal kernel by weighted 
linear combination of some candidate kernels. The learning of 
the classifier parameter and the kernel weight are unified in 
a single objective function considering to minimize the upper 
boundary of the given multivariate performance measure. The 
objective function is optimized with regard to classifier parameter 
and kernel weight alternately in an iterative algorithm by using 
cutting plane algorithm. The developed algorithm is evaluated on 
two different pattern classification methods with regard to various 
multivariate performance measure optimization problems. The 
experiment results show the proposed algorithm outperforms the 
competing methods. 

Index Terms —Pattern recognition, multiple kernel, multivari¬ 
ate performance measures, cutting plane algorithm 

I. Introduction 

In different pattern classification problems, various perfor¬ 
mances are employed to evaluate the classifiers, including 
classification accuracy (ACC), FI score , Matthews Correlation 
Coefficient (MCC), area under the receiver operating charac¬ 
teristic (ROC) Curve (AUC) and recall-precision break even 
point (RP-BEP) of recall-precision curve. Due to the nonlinear 
and nonsmooth nature of many performance measures, it is 
difficult to optimize them directly to learn an optimal classi¬ 
fier. To solve this problem, Joachims III proposed a support 
vector machine learning method for multivariate Performance 
measures (SVM Per ^). This other method has been applied to 
optimized some nonlinear multivariate performance measures 
to learn linear classifiers successfully. However, it is limited 
to the learning of linear classifiers. When data samples of 
different classes cannot be separated by a linear boundary, it is 
suggested to employ the kernel trick to map the data samples 
to a nonlinear high-dimensional data space so that a linear 
boundary could be learned J2), IS, 0. Joachims and Yu [f5ll 
also extended the SVM Per f to its kernel version to handle the 
nonlinearly distributed data. One important shortage of this 


method lies on the choosing of an optimal kernel function 
with its corresponding parameter. In [j5), the RBF-Kernel is 
used to classification problems on some data sets without any 
justification, but it is highly doubt if this kernel is suitable 
for other data sets. Moreover, how the optimal parameter of 
the kernel function possibly influences the results significantly. 
One possible way to solve this problem is to conduct an 
exhausting linear search or a cross validation in the kernel 
function and parameter space by using the training set, which 
is very time-consuming and also makes the learned classifier 
over-fitting to the training samples. 

To solve this problem, we assume that the desired kernel can 
be obtained by the linear combination of some candidate kernel 
functions with different kernel parameters. The optimal kernel 
is parameterized by the linear combination weights associated 
with different kernels. This framework is called Multi-Kernel 
Learning (MKL) since we explore the nonlinear kernel spaces 
of multiple kernels 0. To learn the kernel weights, we cast 
the MKL problem with the multivariate performance measures 
problem, and proposed an unified learning problem for both 
MKL and multivariate performance measures problems. For 
the first time, we propose the problem of learning an optimal 
kernel for multivariate performance measures, and a novel 
solution for this problem by learning kernel in multiple kernel 
spaces simultaneously with optimizing multivariate perfor¬ 
mance measures. 

The rest parts of this paper are organized as follows: in 
section |TI] we introduce the novel method by formulating the 
problem first, optimizing it then, and developing an iterative 
algorithm finally, in section [Till the proposed method is evalu¬ 
ated on some benchmark data sets, and in section lfVl the paper 
is concluded. 

II. Proposed method 
A. Problem Formulation 

We assume we have a training data set with n training 
samples, and the training samples are organized in an training 


matrix X = [xi, • • • , x„] £ R. dx ", where the i-th column x., 
is the (-/-dimensional feature vector of the z-th training sample. 
Moreover, we also organize the class labels in a class label 
vector y = [t/i, - - - 1 y n ] T £ {+1,-1}", where y t £ {+1,-1} 
is the binary class label of the <-th training sample. Under 
the framework of kernel learning G), an sample vector can 
be mapped into a high dimensional nonlinear Hilbert Space, 
via a implicit mapping function (j> : x —> <p(x) £ TZ d , 
where d! d is the dimension of the Hilbert Space. The 
mapping function is explored by a kernel function, which is 
defined as the dot-produce of the mapping of two samples 
X, and Xj, as A'(x i ,x J ) = (f>(xi) T cf>(xj). In the multi-kernel 
learning framework, we may have several such Hilbert Spaces 
available and there corresponding nonlinear mapping functions 
are denoted as {</f> m (x) £ TZ d ™}^ =1 , where M is the number 
of Hilbert Spaces, 0 r „ (x) is the nonlinear mapping function 
of the m-th mapping function, and d' m is the dimension of 
the m-th Hilbert Space. We also define the kernel function 
for the m-th Hilbert space as K m (xi,Xj ) = (f>(x.i ) r ^^ m (x J ). 

We weight and concatenate the mapping function to form 
a longer vector in a more general Hilbert Space, 4>t(x) = 
[ti</>i(x) t ,--- ,t m </>m(x) t ] T £ R d ' where r m £ R+ is 
the nonnegative weight for the m-th Hilbert Space, r = 

[ti, • • • , t m ] T £ is the weight vector, and d! = J2m= 1 f -C. 
is the dimension of the general Hilbert Space. Its corresponding 
kernel function is given as 

M 

7+r (x,;, Xj ) — 0+ ( X,; J (j)~j- (Xj j — } ^ T m A m (x2,Vj) (1) 

m= 1 

It can be seen that the kernel function is also a weighted 
linear combination of the M kernel functions of the M 
Hilbert spaces. We map all the samples to the Hilbert spaces, 
and organize the mapping results in a d! x n matrix as 
</>t-(A) = [^t-(xi), • • • ,cf) x(x„)] £ x ". We can also apply 

the kernel function to the matrix and obtain the n x n kernel 
matrix K T (X,X ) = £™=i T^K m (X,X) £ R" x ", where 
/\ m (A, A) = [i£ m (xi,Xj)] £ R rax " is the kernel matrix of 
the m-th Hilbert space. 

We consider the problem of learning a hypotheses function 
h w (X) which maps a tuple of n samples organized in a data 
matrix X to a label vector of n labels y. To this end, we first 
map the data matrix X to the general Hilbert space cf) T (X), 
and then apply a linear discriminant function of the following 
form 

n 

h w (X) = argmax w T <^T-(X)y' = argmax T (/) 7 -(x i )z/' 

y'e{+i,-i}" y'e{+i-i} n 

(2) 

where w £ is the parameter vector. Actually, it is equal to 
the following prediction results, 

M X) = sign (w t ^ t (A')) (3) 

where sign (•) is an element-wise sign operation function. 


To avoid the over-fitting problem, we try to reduce the com¬ 
plexity of the hypotheses function parameter w by minimizing 
the squared ('2 norm, 

min { 7 HMI 2 = ^w T wl (4) 

w,£,t ^ Z Z J 

We also want to reduce the prediction error of the hypotheses 
function on the training set. To measure the prediction error, 
a loss function can be applied to compare the true class 
label tuple y against the output of the hypotheses function 
h w (X). The following optimization problem is obtained with 
a A(y, h w (X)), 

min A(y,h w (X)). ( 5 ) 

W V 7 

Instead of trying to optimize A(y, h„(X)) directly, we try to 
find its upper boundary and then minimize its upper boundary. 
Given (0, we have the following inequalities, 

w T MX)K(X) > w T ^(X)y',Vy' £ {+1,-1}" 

=i> A(y, h„(X)) + w T (j) T (X) (/i w (X) - y) > A(y, h w (X)) 

( 6 ) 

Thus we have the upper boundary of A(y, h v (X)), and the 
optimization problem in 0 can be relaxed to 

min { A(y, h w (X)) + w T (j) T (X) (h w (X) — y)} . (7) 

We further relax the minimization of A(y ,h m (X)) + 
w t ^t-(A) (hy,(X) — y) to the minimization of its upper 
boundary, which could be obtained by exploring the class label 
tuple space excluding y, yj £ y/y, 

A(y, h^X)) + w T cj)- r {X) (h w (X) - y) 

< , max [A(y, y[) + w T ^(A') (y[ - y)] 

( 8 ) 

Thus we can translate the problem in 0 to 

min [ A (y’y/) + wT ^-r(^)(y!-y)]|- (9) 

w l+yfeT/y J 

It could be further relaxed by introducing a nonnegative slack 
variable £ to represent the upper boundary, so that the problem 
could be rewritten as 

min 

«+ 

s-t. A(y, y[) + w T ^>T-(A)(yJ — y) <C,V/ :y[ £ y/y, ( 10 ^ 
C>0. 

Combining the problems in 0 and (flOl) . and introducing 
constrains on r to prevent negative kernel weights, the follow¬ 
ing overall optimization problem, 


min -w T w + C£, 
W,£,X 2 


s.t. A/y,y[) + w 0-rPO(y£ - y) < f, l- y{ e 3>/y, (11) 

M 

£ > 0, y> m = l,T m >0,rn = !,••• , M. 

m—1 

where C is a tradeoff parameter. 

B. Optimization 

To optimize this problem, we give the primal Lagrangian 
function as follows, 

£(w, £, r, a, /3, 7 ,5) = ^w T w + 

+ 52 ai ( A (y>yJ)+ w T </>-r(^)(y! -y) -?) 

i-y'i&y/y ( 12 ) 

( M \ M 

52 Tm ~ 1 ) - 52 ^ mTm 

m—1 / m=l 

where a; > 0, /3 > 0, 7 > 0 and <5 m > 0 are the Lagrange 
multipliers. We argue the following dual optimization problem. 


To solve this problem, we adopt an alternate optimization 
strategy. In an iterative algorithm, a and t with its Lagrange 
multipliers 7 and 8 are optimized alternately. 

• Optimizing a By fixing r with its Lagrange multipliers 7 
and 8, and only considering a, the optimization problem 
in © is reduced to 


max | -- 


52 a i a k((y-y'i) TK Ax,x)(y-y'k)) 


i,k-yi,y' k ey/y 

+ 52 «* A (y»yJ) 

i-y'^y/y 

s.t. ^2 a i < C, ai > 0, l : y[ G y/y. 
i.y[ey/y 

(17) 

This problem can be solved as a quadratic programming 
problem. 

Solving t By fixing a, and only considering r and its 
Lagrange multipliers 7 and 8 , we have the following 
problem, 


max min £(w, r, a,/3, 7 , <5) 

Qf,p,7,d W,£,T 

si. ai> 0 ,l :y[ G y/y , ( 13 ) 

/3 > 0 ,7 > 0, <5 m > 0, to = 1, ■ • • , M. 

By setting the digestives of the Lagrange function with regard 
to w and / to zero, we have 


— = 0=>w = ^2 ai<j> T {X)(y-y\) 

i-yley/y 
dC 

di 


= 0 =>C- J 2 ai -/3 = 0 ^C> 52 


ctz. 


i-y’^y/y 


i-Y^y/y 


(14) 

By substituting these results and the kernel definition in Q to 
©, we obtain the dual Lagrangian function, 

V(T,a,'y, 8 ) 

= -\ 52 ai0ik ((y-y0 T 52 T m K ™(x,x)(y-y' k ) 

l,h:y'y'ey/y V m=1 / 


M 


M 


52 a ' A (y’y0 — t ( 52 ^ 1 ) — 52 

i:y|eT/y 


\rn=l 


m=1 


(15) 


This optimization problem is then transformed to 

min min 'P(r,a, 7,6) 
t ct, 7 ,<5 

s.t. ai>0,l :y[ Gy/y, C > ^2 (16) 

i-y'^y/y 

7 > 0, 8 m > 0, m = 1, • • • , M. 


mm max <- 

T 7 , <5 I 2 


52 ^My-yD 


/\T 


i^:y[y k ey/y 


M 


X 52 T m^m(^,X)(y - y'fc) 


m=l 

M 


(18) 


M 


7 f 5 , '*~ 7n ^ ) 5 ^ ^m'fm r 

\m=l / m— 1 ) 

si. 7 > 0 , (5 m > 0 , to = 1 , • • ■ , M. 

This is the dual form of a constrained quadratic pro¬ 
gramming problem, and we can solve it as a constrained 
quadratic programming problem. 

Updating y/y Moreover, it should be noted that the 
construction of set y/y is also a problem. To this end, 
we propose to construct y/y sequentially in the iterative 
algorithm. We propose to construct y/y by adding one 
new class label tuple to y/y in each iteration according 
to updated w and r, 


y = argmax ^A(y,y") + 

y"G{+i,-i} n ,y"#y,y"^T/y 


52 My-yD T -M^,Xi)y") }•. 

i-y^y/y 


(19) 


where K T (X,Xi) = [K t (xi , Xj), • • • , Jf T (x„, Xj)] T £ 

R nx l . Then we can update y/y by adding y* to it. 


y/y <- {y*} U Jpy. 


( 20 ) 


C. Algorithm 

The iterative multi-kernel learning algorithm to optimize 
multivariate performance measure is summarized in Algorithm 

□ 


Algorithm 1 Multi-Kernel Learning algorithm for optimize 
multivariate Performance measure Optimization (MKLPO). 
Input: Training sample feature matrix X , and corresponding 
class label tuple y; 

Initialize a 0 and t°; 

Initialize y /y = 0; 
for t = 1, ■ ■ • , T do 

Obtain a predicted class label tuple y* as in ( fl9l > by fixing 
a 4-1 and r 4_1 , and add it to y/ y as in (l20l) : 

Update a * 1 2 * 4 by solving ( fT71 > and fixing r 4_1 ; 

Update r 4 by solving (QjOt and fixing a 4 ; 

end for 

Output: Output the learned a T and t t . 


III. Experiments 
A. Experiment I: Allergen prediction 

In the first experiment, we perform the proposed to the 
problem of allergen prediction to optimized various prediction 
performance measures ©. 

1) Dataset and protocol: In this experiment, we used a 
dataset constructed by Dang and Lawrence (8). This dataset 
contains 42,977 protein sequences, 3,907 of them are allergens 
while the remaining 39,070 are non-allergens. To extract 
feature from each protein sequence, we used the bag-of-words 
method 0- Firstly, the amino acid sequence of a protein 
is broken to some overlapping peptides with a small sliding 
window, and each peptides is treated as a word. To conduct the 
experiment, we perform the popular 10-fold cross validation. 
Various performance measures are considered in this exper¬ 
iment. The multivariate performance measures are optimized 
on the training set and tested on the test set, including AUC, 
RP-BEP, ACC, F score and MCC. 

2 ) Results: We compare the proposed multi-kernel learning 
based multivariate performance measures optimization algo¬ 
rithm agains the original kernel version of SVM Per ^, cutting- 
plane subspace pursuit (CPSP) algorithm 0. Moreover, three 
different variations of SVM Per ^ are also compared as the 

state-of-the-art multivariate performance measures optimiza¬ 
tion methods, including the performance measure optimization 
method by classifier adaptation (CAPO) ITOl . the feature selec¬ 
tion method for multivariate performance measures optimiza¬ 
tion (FSPO) HQ, and the non-decomposable loss functions 
optimization method (NDLO) fl2l . We used these methods 
to optimize the multivariate performances of AUC of ROC, 
PR-BEP of recall-precision curve, ACC, F score, and MCC 
respectively on the training set, and the test them on the test 

set. The boxplots of the corresponding performance measures 
of 10-fold cross validations are given in Figure |T| From 
this figure, we can see clearly that the proposed multi-kernel 
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Fig. 1. Boxplots of optimized multivariate performance measures of 10-fold 
cross validations of allergen prediction problem. 


based multivariate performance measure optimization method 
achieves the best results with regard to different performance 
measures. Similar phenomenon can be observed in Figure 
|l(e)[ and MKLPO is the only algorithm which obtains a 
higher MCC median value than 0.900. For other performance 
measures, MKLPO also optimize them to achieve the best 
performances measures on the test sets. Among the compared 
algorithms, both CPSP and CAPO are improved by using 
kernel trickles. However, due to the limitation of single kernel, 
their performance are not necessarily superior to the linear 
models, FSPO and NDLO. In most cases, their performances 
are comparable to each other. 

B. Experiment II: Rehabilitative speech treatment assessment 

In this experiment, we test the proposed algorithm for the 
automatic assessment of rehabilitative speech treatment. 

1) Dataset and protocol: In this experiment, we use the 
dataset provided by Tsanas et al. m. There are 126 phona- 
tions in the data set. A speech expert is employed to assess the 
phonations, and label them as “acceptable” or “unacceptable”. 
Among the 126 phonations, 42 is labeled as “acceptable” 
while the remaining 84 is labeled as “unacceptable”. Each 
phonation is defined as a data sample in the problem of pattern 
classification, and ‘acceptable” phonation is defined as positive 
sample, while “unacceptable” phonation as negative sample. 
For the purpose of pattern classification, we extract features 
from each of the phonations. To conduct the experiment, 













we also use the 10-fold cross validation. The multivariate 
performance measures are optimized on the training set and 
tested on the test set, including AUC, RP-BEP, ACC, F score 
and MCC. 
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Fig. 2. Boxplots of optimized multivariate performance measures of 10-fold 
cross validations of rehabilitative speech treatment assessment problem. 


2) Results: Fig. [2] shows the boxplots of optimized multi¬ 
variate performance measures of 10-fold cross validations by 
using rehabilitative speech treatment assessment data set. As 
can be seen, our MKLPO algorithm significantly outperforms 
the other multivariate performance measures optimization al¬ 
gorithms in most cases. The performance difference is larger as 
the MCC is optimized as the desired multivariate performance 
measure. The CAPO algorithm outperforms other algorithms 
in most cases slightly besides the proposed MKLPO algorithm. 
This result is consistent with the experiment results given in 
the previous section. 

IV. Conclusions and future works 

Recently a multivariate performance measures optimization 
method is proposed to estimate a given complex multivariate 
performance measure as a linear function. This method is 
based on kernel trick. However, it is difficult to choose a 
suitable kernel function with its corresponding parameter. 
To solve this problem, in this paper, we proposed the first 
multi-kernel learning based algorithm for the problem of 
optimization of multivariate performance measures. We build 
a unified objective function for the learning of both multiple 
kernel weight and classifier parameter for the purpose of 
multivariate performance measure. An iterative algorithm is 


developed to optimize the objective function. The experiment 
results on two different pattern classification problems show 
that the proposed algorithm outperforms the state-of-the-art 
multivariate performance measure optimization methods. In the 
future, we will also explore the potential of using the proposed 
methods to bioinformatics problems m, d, ESI, BD, 
ED, ED, ED), El) . (22), integrated circuit design 11231 . E4) . 
m, ESI, ED, ED, ED, EQl, ED, ED, multiple model 
big data analysis ED, El, El, ES), ED, ED, ED, ED), 
software and network security ATI . Il42l . 11431 . H44I . |45), 11461 . 
ED, ED, El, ED), and power systems optimization ED. 
[52. |. Moreover, we will also improve the proposed method 
by regularizing the learning of classifier by graphs El, El, 

El, ES), ED, El, El, USD), USD- 
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